In baseball, pitchers are able to manipulate their grip of the ball to effect the path, speed, and rotation of their pitches. In Major League Baseball, there are a number of different pitches, which each have their own defining characteristics. There are fastballs, which as their name implies, are fast pitches. There are numerous variations of the fastball, such as the four-seam fastball and the sinker, which vary in their velocity and movement. There are also breaking balls, which are slower than fastballs and move laterally or downward. Some examples of breaking balls are curveballs and sliders. In summary, pitchers in baseball have a variety of different pitch grips that they use to vary their pitches in order to decieve batters. For this project, I wanted to train a machine learning model with data from pitches such that the model can identify the grip that the pitcher used to throw the ball.
For this project, I am using data from MLB’s Statcast, which is a service that utilizes radars to track players. To get the data, I used the baseballr package, which you can find at https://billpetti.github.io/baseballr/. This package allows users to scrape baseball data from numerous sources, including Statcast data from https://baseballsavant.mlb.com.
I used the following packages for this project:
library(baseballr)
library(dplyr)
library(ggplot2)
library(caret)
library(tidyverse)
library(plotly)
The first thing that I did in this project was getting the data into my environment. I used the statcast_search function included in the baseballr package. This function downloads Statcast data for every pitch between the specified start and end date. I chose 09/26/2023 as the start date and 10/01/2023 as the end date, and stored the results in the pitches variable. After the pitch data was downloaded, I took a subset of pitches dataset so I was only looking at data relevant to pitches that will help me classify the kind of pitch grip used. I stored this subset in pitches_sub. The variables that I decided to include were: * pitch_type (abbreviation indicating the pitch grip used) * release_speed (in miles/hour) * p_throws (character indicating the pitcher’s handedness) * pfx_x, pfx_z (horizontal and vertical movement of the pitch in feet) * vx0, vz0, vy0 (pitch velocity in feet per second in x, z, and y dimension) * ax, ay, az (pitch acceleration in ft/s^2 in x, y, and z dimension) * spin_axis (spin axis in 2D plane, from 0 to 360)
pitches = statcast_search(start_date = "2023-09-26", end_date = "2023-10-01")
pitches_sub = subset(pitches, select = c(pitch_type, release_speed, p_throws, pfx_x, pfx_z, vx0, vy0, vz0, ax, ay, az, spin_axis))
Here is the head and dimension of pitches_sub:
head(pitches_sub)
dim(pitches_sub)
## [1] 25000 12
One problem is that left handed and right handed pitchers have different movement on their pitches. For example, a slider that is thrown by a left handed pitcher moves to the left (from the catcher’s perspective). A slider thrown by a righty, however, will move to the right. Here is a plot visualizing this:
sliders = pitches_sub %>%
filter(pitch_type == "SL") %>%
group_by(p_throws)
ggplot(sliders, aes(x = pfx_x, y = pfx_z, color = p_throws)) +
geom_point()
As you can see in the plot, the horizontal pitch movement of sliders is essentially opposite for left handed pitchers and right handed pitchers. We would expect a similar difference for all variables in our dataset that represent the movement of the ball horizontally. To adress this, I multiplied pfx_x, vx0, and ax by -1 for pitches thrown by left handed pitchers. For pitch_axis, I set it equal to 360-pitch_axis for left handed pitchers. This way, pitch data is standardized for left handed and right handed pitchers.
#Standardizing data for right handed and left handed pitchers
pitches_sub = pitches_sub %>%
mutate(spin_axis = case_when(p_throws == "L" ~ (360-spin_axis), p_throws == "R" ~ spin_axis), ax = case_when(p_throws == "L" ~ (-1*ax), p_throws == "R" ~ ax), vx0 = case_when(p_throws == "L" ~ (-1*vx0), p_throws == "R" ~ vx0), pfx_x = case_when(p_throws == "L" ~ (-1*pfx_x), p_throws == "R" ~ pfx_x))
Here is the plot of slider movement now that the data has been standardized:
sliders = pitches_sub %>%
filter(pitch_type == "SL") %>%
group_by(p_throws)
ggplot(sliders, aes(x = pfx_x, y = pfx_z, color = p_throws)) +
geom_point()
Now, the movement in the dataset is essentially the same for sliders
thrown by left handed and right handed pitchers.
Now that we have standardized the data for pitches that are thrown by left handed and right handed pitchers, we can remove the p_throws variable from the pitches_sub dataset, as it will not be necessary in classifying pitch types. We can now look at a summary of the data and check for NA values.
#Remove p_throws column
pitches_sub = subset(pitches_sub, select = -c(p_throws))
summary(pitches_sub)
## pitch_type release_speed pfx_x pfx_z
## Length:25000 Min. : 39.90 Min. :-1.9600 Min. :-2.0200
## Class :character 1st Qu.: 84.70 1st Qu.:-1.0900 1st Qu.: 0.1400
## Mode :character Median : 89.90 Median :-0.5600 Median : 0.6400
## Mean : 89.08 Mean :-0.3727 Mean : 0.5894
## 3rd Qu.: 94.00 3rd Qu.: 0.2800 3rd Qu.: 1.1800
## Max. :102.70 Max. : 2.0300 Max. : 2.0600
## NA's :1 NA's :1 NA's :1
## vx0 vy0 vz0 ax
## Min. :-6.999 Min. :-149.15 Min. :-14.947 Min. :-27.302
## 1st Qu.: 3.725 1st Qu.:-136.63 1st Qu.: -5.743 1st Qu.:-14.026
## Median : 5.649 Median :-130.78 Median : -3.860 Median : -8.088
## Mean : 5.665 Mean :-129.56 Mean : -3.768 Mean : -6.104
## 3rd Qu.: 7.588 3rd Qu.:-123.21 3rd Qu.: -1.857 3rd Qu.: 2.084
## Max. :19.811 Max. : -57.48 Max. : 12.061 Max. : 19.651
## NA's :1 NA's :1 NA's :1 NA's :1
## ay az spin_axis
## Min. : 5.272 Min. :-50.770 Min. : 0.0
## 1st Qu.:23.956 1st Qu.:-30.310 1st Qu.:136.0
## Median :26.859 Median :-24.393 Median :211.0
## Mean :26.970 Mean :-24.149 Mean :179.6
## 3rd Qu.:29.974 3rd Qu.:-16.671 3rd Qu.:227.0
## Max. :41.311 Max. : -5.012 Max. :360.0
## NA's :1 NA's :1 NA's :2
#Getting all rows with NA values and printing them
narow = pitches_sub[!complete.cases(pitches_sub), ]
print(narow)
## ── MLB Baseball Savant Statcast Search data from baseballsavant.mlb.com ────────
## ℹ Data updated: 2023-10-31 22:09:26 MST
## # A tibble: 2 × 11
## pitch_type release_speed pfx_x pfx_z vx0 vy0 vz0 ax ay az
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "FF" 94.4 -0.71 1.41 7.01 -137. -4.32 -10.6 31.7 -13.6
## 2 "" NA NA NA NA NA NA NA NA NA
## # ℹ 1 more variable: spin_axis <dbl>
Here, we see that one row has NA values for all of the variables. This row will not be able to contribute to training a model, so we will remove this row from pitches_sub. The other row with an NA value only has an NA value for the spin axis. While this value may be NA, some of the other values could be important for pitch classification. Therefore, we will set the spin_axis for this row equal to the median spin_axis.
#Delete row with release_speed == NA (removes row with all NA values)
pitches_sub = pitches_sub[!is.na(pitches_sub$release_speed),]
#Set spin_axis == median(spin_axis) for row with missing spin axis value
pitches_sub[is.na(pitches_sub$spin_axis),]$spin_axis = median(pitches_sub$spin_axis, na.rm = T)
summary(pitches_sub)
## pitch_type release_speed pfx_x pfx_z
## Length:24999 Min. : 39.90 Min. :-1.9600 Min. :-2.0200
## Class :character 1st Qu.: 84.70 1st Qu.:-1.0900 1st Qu.: 0.1400
## Mode :character Median : 89.90 Median :-0.5600 Median : 0.6400
## Mean : 89.08 Mean :-0.3727 Mean : 0.5894
## 3rd Qu.: 94.00 3rd Qu.: 0.2800 3rd Qu.: 1.1800
## Max. :102.70 Max. : 2.0300 Max. : 2.0600
## vx0 vy0 vz0 ax
## Min. :-6.999 Min. :-149.15 Min. :-14.947 Min. :-27.302
## 1st Qu.: 3.725 1st Qu.:-136.63 1st Qu.: -5.743 1st Qu.:-14.026
## Median : 5.649 Median :-130.78 Median : -3.860 Median : -8.088
## Mean : 5.665 Mean :-129.56 Mean : -3.768 Mean : -6.104
## 3rd Qu.: 7.588 3rd Qu.:-123.21 3rd Qu.: -1.857 3rd Qu.: 2.084
## Max. :19.811 Max. : -57.48 Max. : 12.061 Max. : 19.651
## ay az spin_axis
## Min. : 5.272 Min. :-50.770 Min. : 0.0
## 1st Qu.:23.956 1st Qu.:-30.310 1st Qu.:136.0
## Median :26.859 Median :-24.393 Median :211.0
## Mean :26.970 Mean :-24.149 Mean :179.6
## 3rd Qu.:29.974 3rd Qu.:-16.671 3rd Qu.:227.0
## Max. :41.311 Max. : -5.012 Max. :360.0
Let’s take an initial look at some of the variables and how they may relate to pitch type.
ptdata = pitches_sub %>%
group_by(pitch_type)
#Bar of average pitch speed by pitch type
spdplt = ggplot(ptdata) +
geom_bar(aes(x=pitch_type, y = release_speed, fill = pitch_type), stat = "summary") + ggtitle("Average Pitch Velocity by Pitch Type")
spdplt
#Scatterplot of pitch movement along x and z axes, grouped by pitch type
movplt = ggplot(ptdata, aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point() +
ggtitle("Pitch Movement Along X and Z axes, Grouped By Pitch Type")
movplt
#3D plot of pitch velocity in x, y, z dimensions, grouped by pitch type
v0plt <- plot_ly(ptdata, x = ~vx0, y = ~vz0, z = ~vy0, color = ~pitch_type)
v0plt <- v0plt %>% add_markers()
v0plt <- v0plt %>% layout(scene = list(xaxis = list(title = 'Velocity in x dimension'),
yaxis = list(title = 'Velocity in z dimension'),
zaxis = list(title = 'Velocity in y dimension')))
v0plt
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#3D plot of pitch acceleration in x, y, z dimensions, grouped by pitch type
aplt <- plot_ly(ptdata, x = ~ax, y = ~az, z = ~ay, color = ~pitch_type)
aplt <- v0plt %>% add_markers()
aplt <- v0plt %>% layout(scene = list(xaxis = list(title = 'Acceleration in x dimension'),
yaxis = list(title = 'Acceleration in z dimension'),
zaxis = list(title = 'Acceleration in y dimension')))
aplt
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#Bar of average spin axis by pitch type
spinplt = ggplot(ptdata) +
geom_bar(aes(x=pitch_type, y = spin_axis, fill = pitch_type), stat = "summary") + ggtitle("Average Spin Axis by Pitch Type")
spinplt
Now, we have removed all NA values from our data and have standardized the data for left and right handed pitchers, our data is clean. We can now partition the data into a training and testing set. Here, we will use 80% of the data to train the models and will use 20% of the model for testing. We can look at the dimensions of the training and testing set.
set.seed(150)
split = createDataPartition(pitches_sub$pitch_type, times = 1, p = .8, list = F)
train = pitches_sub[split, ]
test = pitches_sub[-split, ]
dim(train)
## [1] 20006 11
dim(test)
## [1] 4993 11
To avoid overfitting our model to the training set, we will use k-folds cross validation. This method “folds” the data a number of times (specified by parameter “number”). It will then train the model on a certain number of folds of the training set. Each time, it sets aside a fold (from within the training set) for testing. I set it to repeat this process 3 times.
kfolds = trainControl(method="repeatedcv", number = 4, repeats = 3, verboseIter = F)
Now, we can train machine learning models to classify pitches. We will initially train 7 classification models: * rpart (CART) - Recursive Partitioning and Regression Trees * treebag (Bagged CART) - Bagging * rf - Random Forest * C5.0 * lda - Linear Descriminant Analysis * glmnet - Lasso and Elastic-Net Regularized Generalized Linear Models * knn - k Nearest Neighbors
We will train these models and record the accuraccy. After the models finish training, we will assess which models are the most accurate and then will use the models to predict pitch types based on the test set.
#Commented for pdf knitting
# set.seed(19)
# rpMod = train(pitch_type~., data = train, method = "rpart", metric = "Accuracy", trControl = kfolds)
#
# set.seed(19)
# tbMod = train(pitch_type~., data = train, method = "treebag", metric = "Accuracy", trControl = kfolds)
#
# set.seed(19)
# rfMod = train(pitch_type~., data = train, method = "rf", metric = "Accuracy", trControl = kfolds)
#
#
# set.seed(19)
# c50Mod = train(pitch_type~., data = train, method = "C5.0", metric = "Accuracy", trControl = kfolds)
#
# set.seed(19)
# ldaMod = train(pitch_type~., data = train, method = "lda", metric = "Accuracy", trControl = kfolds)
#
# set.seed(19)
# gnMod= train(pitch_type~., data = train, method = "glmnet", metric = "Accuracy", trControl = kfolds)
#
# set.seed(19)
# knnMod = train(pitch_type~., data = train, method = "knn", metric = "Accuracy", trControl = kfolds)
#Commmented for pdf knitting
# saveRDS(rpMod, "rpMod.rds")
# saveRDS(tbMod, "tbMod.rds")
# saveRDS(rfMod, "rfMod.rds")
# saveRDS(c50Mod,"c50Mod.rds")
# saveRDS(ldaMod, "ldaMod.rds")
# saveRDS(gnMod, "gnMod.rds")
# saveRDS(knnMod, "knnMod.rds")
#Commented for rmd file
rpMod = readRDS("rpMod.rds")
tbMod = readRDS("tbMod.rds")
rfMod = readRDS("rfMod.rds")
c50Mod = readRDS("c50Mod.rds")
ldaMod = readRDS("ldaMod.rds")
gnMod = readRDS("gnMod.rds")
knnMod = readRDS("knnMod.rds")
Now that the models have trained, we can print the information for each model. Take note of the accuracy for each model (look at the highest accuracy measurement for models with tuning parameters).
print(rpMod)
## CART
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.1325909 0.6543890 0.55798465
## 0.1827023 0.5570297 0.43035614
## 0.2086860 0.3499020 0.05190027
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1325909.
print(tbMod)
## Bagged CART
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8345167 0.7950233
print(rfMod)
## Random Forest
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8420477 0.8040507
## 6 0.8421141 0.8042848
## 10 0.8396816 0.8013407
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
print(c50Mod)
## C5.0
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8112403 0.7659172
## rules FALSE 10 0.8265358 0.7856857
## rules FALSE 20 0.8310341 0.7911977
## rules TRUE 1 0.8119232 0.7668352
## rules TRUE 10 0.8248364 0.7836628
## rules TRUE 20 0.8296847 0.7895917
## tree FALSE 1 0.8049920 0.7586334
## tree FALSE 10 0.8280520 0.7869033
## tree FALSE 20 0.8318838 0.7916633
## tree TRUE 1 0.8054752 0.7592522
## tree TRUE 10 0.8263860 0.7848405
## tree TRUE 20 0.8308010 0.7902961
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
print(ldaMod)
## Linear Discriminant Analysis
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7838488 0.7348521
print(gnMod)
## glmnet
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0007010387 0.8094647 0.7629901
## 0.10 0.0070103868 0.7879347 0.7341118
## 0.10 0.0701038676 0.7273113 0.6505621
## 0.55 0.0007010387 0.8099314 0.7635931
## 0.55 0.0070103868 0.7860350 0.7315013
## 0.55 0.0701038676 0.7066481 0.6218244
## 1.00 0.0007010387 0.8098315 0.7635993
## 1.00 0.0070103868 0.7863352 0.7320381
## 1.00 0.0701038676 0.6414594 0.5337931
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.55 and lambda = 0.0007010387.
print(knnMod)
## k-Nearest Neighbors
##
## 20006 samples
## 10 predictor
## 16 classes: 'CH', 'CS', 'CU', 'EP', 'FA', 'FC', 'FF', 'FO', 'FS', 'KC', 'KN', 'PO', 'SI', 'SL', 'ST', 'SV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 15004, 15006, 15005, 15003, 15004, 15005, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8342833 0.7949531
## 7 0.8352663 0.7960385
## 9 0.8347829 0.7953894
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
After printing the model information, here is the ranking of models according to accuracy that I got: Model accuracy (highest to lowest): 1. Random Forest 2. KNN 3. Bagged CART 4. C5.0 5. glmnet 6. LDA 7. CART
Now we can evaluate the accuracy of the model’s predictions on the test set. For this, I will use the Random Forest, KNN, and Bagged CART models. In order to evaluate the predictions, we will use confusion matrices and accuracy.
rf_pred = predict(rfMod, test)
pitch_vals_rf = union(rf_pred, test$pitch_type)
rf_mat = confusionMatrix(data = factor(rf_pred, levels = pitch_vals_rf), reference = factor(test$pitch_type, levels = pitch_vals_rf))
rf_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction SL FF CH SI FS FC ST CU KC FA SV EP KN
## SL 683 4 2 0 1 92 78 53 5 0 4 0 1
## FF 0 1549 2 56 1 17 0 0 0 0 0 0 0
## CH 1 1 552 23 56 2 0 0 0 0 0 0 1
## SI 0 49 28 749 8 0 0 0 0 0 0 0 0
## FS 6 0 24 0 34 2 0 0 0 0 0 0 0
## FC 53 30 1 0 3 241 1 0 0 0 0 0 0
## ST 69 0 0 0 0 0 124 15 3 0 3 0 0
## CU 25 0 0 0 0 0 10 249 36 0 3 0 1
## KC 1 0 0 0 0 0 1 8 24 0 0 0 1
## FA 0 0 0 0 0 0 0 0 0 0 0 1 0
## SV 0 0 0 0 0 0 0 1 0 0 1 0 0
## EP 0 0 0 0 0 0 0 0 0 0 0 1 0
## KN 0 0 0 0 0 0 0 0 0 0 0 0 0
## FO 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction FO
## SL 1
## FF 0
## CH 1
## SI 0
## FS 1
## FC 0
## ST 0
## CU 0
## KC 0
## FA 0
## SV 0
## EP 0
## KN 0
## FO 0
##
## Overall Statistics
##
## Accuracy : 0.8426
## 95% CI : (0.8322, 0.8526)
## No Information Rate : 0.3271
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8052
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity 0.8150 0.9486 0.9064 0.9046 0.33010
## Specificity 0.9420 0.9774 0.9806 0.9796 0.99325
## Pos Pred Value 0.7392 0.9532 0.8666 0.8981 0.50746
## Neg Pred Value 0.9619 0.9751 0.9869 0.9810 0.98599
## Prevalence 0.1678 0.3271 0.1220 0.1658 0.02063
## Detection Rate 0.1368 0.3102 0.1106 0.1500 0.00681
## Detection Prevalence 0.1851 0.3255 0.1276 0.1670 0.01342
## Balanced Accuracy 0.8785 0.9630 0.9435 0.9421 0.66167
## Class: FC Class: ST Class: CU Class: KC Class: FA
## Sensitivity 0.68079 0.57944 0.76380 0.352941 NA
## Specificity 0.98103 0.98117 0.98393 0.997766 0.9997997
## Pos Pred Value 0.73252 0.57944 0.76852 0.685714 NA
## Neg Pred Value 0.97577 0.98117 0.98351 0.991125 NA
## Prevalence 0.07090 0.04286 0.06529 0.013619 0.0000000
## Detection Rate 0.04827 0.02483 0.04987 0.004807 0.0000000
## Detection Prevalence 0.06589 0.04286 0.06489 0.007010 0.0002003
## Balanced Accuracy 0.83091 0.78030 0.87387 0.675354 NA
## Class: SV Class: EP Class: KN Class: FO
## Sensitivity 0.0909091 0.5000000 0.0000000 0.0000000
## Specificity 0.9997993 1.0000000 1.0000000 1.0000000
## Pos Pred Value 0.5000000 1.0000000 NaN NaN
## Neg Pred Value 0.9979964 0.9997997 0.9991989 0.9993992
## Prevalence 0.0022031 0.0004006 0.0008011 0.0006008
## Detection Rate 0.0002003 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0004006 0.0002003 0.0000000 0.0000000
## Balanced Accuracy 0.5453542 0.7500000 0.5000000 0.5000000
knn_pred = predict(knnMod, test)
pitch_vals_knn = union(knn_pred, test$pitch_type)
knn_mat = confusionMatrix(data = factor(knn_pred, levels = pitch_vals_knn), reference = factor(test$pitch_type, levels = pitch_vals_knn))
knn_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction SL FF CH SI FS FC ST CU KC FA SV EP KN
## SL 665 2 6 0 4 98 68 47 3 0 4 0 2
## FF 1 1555 3 63 2 28 0 0 0 0 0 0 0
## CH 5 0 528 17 52 6 0 0 0 0 0 0 0
## SI 0 43 40 747 11 0 0 0 0 0 0 0 0
## FS 3 1 32 1 33 1 0 0 0 0 0 0 0
## FC 62 32 0 0 1 221 0 0 0 0 0 0 0
## ST 69 0 0 0 0 0 131 14 1 0 2 0 0
## CU 30 0 0 0 0 0 11 253 32 0 2 0 1
## KC 2 0 0 0 0 0 3 11 32 0 1 0 1
## FA 0 0 0 0 0 0 0 0 0 0 0 1 0
## SV 1 0 0 0 0 0 1 1 0 0 2 0 0
## EP 0 0 0 0 0 0 0 0 0 0 0 1 0
## KN 0 0 0 0 0 0 0 0 0 0 0 0 0
## FO 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction FO
## SL 1
## FF 0
## CH 1
## SI 0
## FS 1
## FC 0
## ST 0
## CU 0
## KC 0
## FA 0
## SV 0
## EP 0
## KN 0
## FO 0
##
## Overall Statistics
##
## Accuracy : 0.8348
## 95% CI : (0.8242, 0.845)
## No Information Rate : 0.3271
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7954
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity 0.7936 0.9522 0.8670 0.9022 0.320388
## Specificity 0.9434 0.9711 0.9815 0.9774 0.992025
## Pos Pred Value 0.7389 0.9413 0.8670 0.8882 0.458333
## Neg Pred Value 0.9577 0.9767 0.9815 0.9805 0.985775
## Prevalence 0.1678 0.3271 0.1220 0.1658 0.020629
## Detection Rate 0.1332 0.3114 0.1057 0.1496 0.006609
## Detection Prevalence 0.1803 0.3309 0.1220 0.1684 0.014420
## Balanced Accuracy 0.8685 0.9617 0.9243 0.9398 0.656206
## Class: FC Class: ST Class: CU Class: KC Class: FA
## Sensitivity 0.62429 0.61215 0.77607 0.470588 NA
## Specificity 0.97952 0.98200 0.98372 0.996345 0.9997997
## Pos Pred Value 0.69937 0.60369 0.76900 0.640000 NA
## Neg Pred Value 0.97156 0.98262 0.98435 0.992717 NA
## Prevalence 0.07090 0.04286 0.06529 0.013619 0.0000000
## Detection Rate 0.04426 0.02624 0.05067 0.006409 0.0000000
## Detection Prevalence 0.06329 0.04346 0.06589 0.010014 0.0002003
## Balanced Accuracy 0.80191 0.79708 0.87989 0.733467 NA
## Class: SV Class: EP Class: KN Class: FO
## Sensitivity 0.1818182 0.5000000 0.0000000 0.0000000
## Specificity 0.9993978 1.0000000 1.0000000 1.0000000
## Pos Pred Value 0.4000000 1.0000000 NaN NaN
## Neg Pred Value 0.9981957 0.9997997 0.9991989 0.9993992
## Prevalence 0.0022031 0.0004006 0.0008011 0.0006008
## Detection Rate 0.0004006 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0010014 0.0002003 0.0000000 0.0000000
## Balanced Accuracy 0.5906080 0.7500000 0.5000000 0.5000000
tb_pred = predict(tbMod, test)
pitch_vals_tb = union(tb_pred, test$pitch_type)
tb_mat = confusionMatrix(data = factor(tb_pred, levels = pitch_vals_tb), reference = factor(test$pitch_type, levels = pitch_vals_tb))
tb_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction SL FF CH SI FS FC ST CU SV KC FA EP KN
## SL 677 1 5 0 4 92 75 57 4 4 0 0 1
## FF 1 1546 4 61 1 23 0 0 0 0 0 0 0
## CH 2 2 548 28 60 1 0 0 0 0 0 0 1
## SI 0 53 28 737 7 0 0 0 0 0 0 0 0
## FS 3 1 23 1 31 2 0 0 0 0 0 0 0
## FC 63 30 1 0 0 236 0 0 0 0 0 0 0
## ST 65 0 0 0 0 0 126 16 5 2 0 0 0
## CU 26 0 0 0 0 0 11 242 1 36 0 0 1
## SV 0 0 0 0 0 0 0 1 1 0 0 0 0
## KC 1 0 0 1 0 0 2 10 0 26 0 0 1
## FA 0 0 0 0 0 0 0 0 0 0 0 1 0
## EP 0 0 0 0 0 0 0 0 0 0 0 1 0
## KN 0 0 0 0 0 0 0 0 0 0 0 0 0
## FO 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction FO
## SL 1
## FF 0
## CH 1
## SI 0
## FS 1
## FC 0
## ST 0
## CU 0
## SV 0
## KC 0
## FA 0
## EP 0
## KN 0
## FO 0
##
## Overall Statistics
##
## Accuracy : 0.8354
## 95% CI : (0.8248, 0.8456)
## No Information Rate : 0.3271
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7962
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: SL Class: FF Class: CH Class: SI Class: FS
## Sensitivity 0.8079 0.9467 0.8998 0.8901 0.300971
## Specificity 0.9413 0.9732 0.9783 0.9789 0.993661
## Pos Pred Value 0.7351 0.9450 0.8523 0.8933 0.500000
## Neg Pred Value 0.9605 0.9741 0.9860 0.9782 0.985398
## Prevalence 0.1678 0.3271 0.1220 0.1658 0.020629
## Detection Rate 0.1356 0.3096 0.1098 0.1476 0.006209
## Detection Prevalence 0.1845 0.3277 0.1288 0.1652 0.012417
## Balanced Accuracy 0.8746 0.9600 0.9391 0.9345 0.647316
## Class: FC Class: ST Class: CU Class: SV Class: KC
## Sensitivity 0.66667 0.58879 0.74233 0.0909091 0.382353
## Specificity 0.97974 0.98159 0.98393 0.9997993 0.996954
## Pos Pred Value 0.71515 0.58879 0.76341 0.5000000 0.634146
## Neg Pred Value 0.97469 0.98159 0.98204 0.9979964 0.991519
## Prevalence 0.07090 0.04286 0.06529 0.0022031 0.013619
## Detection Rate 0.04727 0.02524 0.04847 0.0002003 0.005207
## Detection Prevalence 0.06609 0.04286 0.06349 0.0004006 0.008211
## Balanced Accuracy 0.82320 0.78519 0.86313 0.5453542 0.689654
## Class: FA Class: EP Class: KN Class: FO
## Sensitivity NA 0.5000000 0.0000000 0.0000000
## Specificity 0.9997997 1.0000000 1.0000000 1.0000000
## Pos Pred Value NA 1.0000000 NaN NaN
## Neg Pred Value NA 0.9997997 0.9991989 0.9993992
## Prevalence 0.0000000 0.0004006 0.0008011 0.0006008
## Detection Rate 0.0000000 0.0002003 0.0000000 0.0000000
## Detection Prevalence 0.0002003 0.0002003 0.0000000 0.0000000
## Balanced Accuracy NA 0.7500000 0.5000000 0.5000000
After making predictions on the test data with our models, we see that the accuracy of the models remains fairly similar to the accuracy numbers we got after performing k-folds cross validation on the test set. Looking at the confusion matrices, there are a few trends that we can observe. One of them is that the models is less accurate when classifying sweepers (ST). The sweeper was popularized across baseball this year, and it most clearly resembles a slider, but with more lateral movement. The models also struggled classifying knuckle curvebals (KC), which are a variation of sliders. The model could potentially get better at classifying these variations on pitches with larger samples or models that are more personalized for individual pitchers. Another point to note is that the model had a hard time classifying forkballs (FO), eephus (EP), slurve (SV), and knuckleballs (KN). These pitches are very rare, so a larger sample size would also be necessary to improve accuracy for classifying these pitches. Despite this, these models performed fairly well overall. The accuracies of the models selected for testing ranged from .82 to .85 and the most accurate model was the Random Forest model. I would say that the models are generally good classifiers of pitch type. I hope to experiment more with using machine learning for classifying pitches in the future. Hopefully, I will be able to imrove the accuracies of the models and learn more about classification along the way.